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When two variables are related by a known function, the coefficient of deter- 
00 mination (denoted R 2 ) measures the proportion of the total variance in the obser- 

CN vations that is explained by that function. This quantifies the strength of the rela- 

tionship between variables by describing what proportion of the variance is signal 
as opposed to noise. For linear relationships, this is equal to the square of the cor- 
relation coefficient, p. When the parametric form of the relationship is unknown, 
however, it is unclear how to estimate the proportion of explained variance equi- 
tably ( Reshef et al.| |201 1) - assigning similar values to equally noisy relationships. 
• • Here we demonstrate how to directly estimate a generalized R 2 when the form of 

. ^ the relationship is unknown, and we question the performance of the Maximal In- 

^ formation Coefficient (MIC) - a recently proposed information theoretic measure 

of dependence ( jReshef et al.||2011) . We show that our approach behaves equitably, 



o 



has more power than MIC to detect association between variables, and converges 
faster with increasing sample size. Most importantly, our approach generalizes 
to higher dimensions, which allows us to estimate the strength of multivariate re- 
lationships (Y against A, B, . . .) and to measure association while controlling for 
covariates (Y against X controlling for C), whose importance was highlighted in 
Speedl ( |20ri) . 



In Reshef et al. (2011 1, the authors describe desired properties of a measure of 
bivariate association: generality and equitability. A measure that is general will dis- 
cover, with sufficient sample size, any departure from independence, while a measure 
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Figure 1: Illustrating the generalized R 2 . Panel A: Data is normally distributed 
about the alternative model - the white regression line Y = sin(X). The null model 
is the blue Y = 0. Marginal distributions of X and Y are represented above and 
to the right. The classical R 2 is calculated using deviations of the samples from the 
blue and white lines. Panel B depicts the probability distribution over Xi, yi for the 
alternative (red) and null (blue) models. Panel C shows the height of the observations 
on the alternative distribution, relative to the null distribution. The generalized R 2 is 
calculated from the ratio of these heights, and does not require an explicit regression 
line (white), which is included only as a guide for the eye. 



that is equitable will assign similar scores to equally noisy relationships of different 
kinds. A further attractive property is that a measure should scale like the coefficient 
of determination (i? 2 ): the proportion of variance explained. 

Reshef et al. ( 201 \) demonstrates that other measures of association (including 



Spearman's rank correlation, mutual information, maximal correlation and principal 
curve-based correlation) are not equitable; different functional forms with similar amounts 
of noise can produce vastly different estimates of association strength. Here we show 
that generality and equability can be achieved by estimating a generalized R 2 through 
density approximation. 

First consider a function with additive noise, y = f(x) + Af. The coefficient 
of determination is the proportion of variance in y "explained" by the deterministic 
component f(x) relative to the total variance in y, which is inflated by unexplained 
stochastic noise, Af. This notion of variance is defined in terms of average squared 
deviations - the distance between a data point and a model. The explained variance, R 2 , 
is 1 — & Error / a Totai> wnere the total variance (<J^ otal ) is the average squared deviation 
from a flat "null" model and the error variance (<J Error ) is the average squared deviation 
from f(x), the "alternative" model. 

Least squares regression assumes that observations are normally distributed about 
the explanatory function. The deviation of a point from the regression line can thus 



be expressed as a probability density, and R has an equivalent form ( |Cox and Snell 



1989;Magee 1990 Nagelkerke 1991): 
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P 2 , TT /~ P{xj,yi\null) \ 2/n 

R =1 -ll{p{ Xi , yi \alt)) (1) 

This formulation of R 2 asserts that the proportion of unexplained variance is the ge- 
ometric mean of the squared ratio of the probability of observing a data point under 
the null model over the probability of that data point under the alternative model. The 
explained variance is 1 minus the unexplained proportion. See figure 1 for a visual 
depiction. 

Since this R 2 now depends only on the probability density ratio between two mod- 
els, it is applicable even when the assumptions behind least squares regression are 
violated. This is a powerful rethinking of R 2 . The idea of "explained variance" is 
generalized away from the restrictive assumptions of normally distributed noise, and, 
most importantly, the very notion of a regression curve is no longer required. This gen- 
eralized R 2 can be calculated as long as the probability distributions for the null and 
alternative models can be evaluated. 

We base our measure of dependence between variables upon this generalized R 2 . 
Even when a known distribution generates our data, we still need to specify the null dis- 
tribution before R 2 can be computed, but this generalized definition of R 2 is agnostic 
about a choice of null model. An attractive property for a measure of dependence is that 
it is if and only if X and Y are independent. A sensible choice of null model is thus 
where P(X, Y) = P(X)P(Y), enforcing independence. Since explicitly choosing a 
null distribution places a restriction on the generalized R 2 , we distinguish our measure 
of association, calling it A. Classical R 2 from least squares regression assumes a dif- 
ferent choice of null model (a constant function with normally distributed errors), so A 
can be thought of as a sister to classical R 2 . They are equivalent for bivariate Gaussian 
distributions, where the marginals are also normally distributed, but will differ (see 
SI1) when the null model for classical R 2 - a constant function with Gaussian errors - 
is a particularly bad fit. A also has an information theoretic interpretation: for known 
distributions it is a sample estimate that converges to Linfoot's 'Informational Measure 



of Correlation' (Linfoot 1957 1 when the number of observations tends to infinity (see 
SI2). 

So far, the computation of A requires a known distribution. Estimating A m A for 
a number of observations from an unknown distribution thus reduces to the problem 
of estimating the density at each point for an independent null and (potentially) depen- 



dent alternative model. We adopt a kernel density approach (Rosenblatt 1956 Parzen 



1962), where the density of the distribution at each point is approximated by the sum 



over a number of Gaussian 'kernel' distributions centered at nearby points (see figure 
[2]). For the null model, we constrain the joint density to be the product of estimates 
of the marginal densities, enforcing independence. We wish to constrain A to vary 
between and 1, so we cannot allow the null to outperform the alternative model, lest 
A become negative. We thus define the density of the alternative model at each sample 
point to be a weighted mixture of dependent (full joint) and independent (product of 
marginal) models, with a single mixture parameter controlling the proportion for all 
points. Thus the alternative model can reduce to the null model as a special case, en- 
suring non-negativity. We estimate the model parameters - and thus the densities - by 
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Null (Y is independent of X) 



Figure 2: Estimating an unknown distribution. The distribution for the alternative 
model (red - where X can depend on Y) is constructed by adding two dimensional 
Gaussian "kernel" distributions centered at each observation. As more of these ker- 
nels are added, the distribution comes to resemble the true distribution from which the 
observations are sampled. We can use a similar approach to estimating a null model 
that expressly disallows any dependence between X and Y (blue) by constructing one 
dimensional marginal distributions (the blue lines to either side) by summing one di- 
mensional Gaussian kernels, and then creating the joint distribution as the product of 
these estimated marginals. 



maximizing the cross-validation likelihood (see Methods). 

Figure pi demonstrates that A is approximately equitable across a number of rela- 
tionships (see SB for details), and is in greater agreement with classical R 2 than is 
MIC, especially for relationships where R 2 is close to 0. When the marginal distribu- 
tion of a variable departs substantially from a normal distribution, A (like MIC) may 
produce more conservative estimates of association than classical R 2 (see SI1). This is 
because the null model for A makes less restrictive assumptions (only independence is 
assumed, without a parametric form), describing the data better than the null model for 
classical R 2 , which is a constant function with Gaussian errors. 

A converges faster with increasing sample size than MIC (figure^- bottom panels). 
For example, despite having a theoretical large-sample limit of 1 for a noiseless circle 



( |Reshef et al.||2011) , MIC « 0.74 when N = 10000. In contrast, A > 0.99 when 
N > 200. 

We also introduce a statistical test for non-independence associated with A, com- 
puting p-values through a randomization procedure (See SI4 for details). The A test 
has greater power to detect associations than MIC for all but one of the relationships we 
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Figure 3: Equitability and convergence of A. Top: For functions of 2 variables, 
A is approximately equitable, as demonstrated with 16 example functions (see SB 
for a list). Each function is marked with a different color. A (left) is closer to the 
classical R 2 than MIC (right), especially for associations near independence. Bottom: 
Estimates of association from A (left) and MIC (right) as sample size (N) increases 
for three different relationships: a noiseless circle (red), a bivariate normal distribution 
with expected R 2 = 0.5 (green), and independent noise (blue). MIC converges very 
slowly. 



tested, and outperforms Szekely's dCov test for association (Sze kely and Rizzo] |2009| l 
for all non-linear relationships tested. The A test was more comparable in performance 
to the recently proposed HHG test ( Heller et al. 201 2} , having greater power on 4 out 
of 7 tested relationships. See SI4 for further results. 

As shown in figure [4J A generalizes to multiple dimensions, producing equitable 
estimates very similar to classical R 2 for functions of two dimensions. It can assess 
the strength of association between vector valued variables, indicating what proportion 
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Figure 4: Equitability of A in higher dimensions. A against R 2 (left) for multivari- 
ate datasets generated by adding normally distributed noise to 16 different functions of 
two variables (right - see SI3 for more detail). 



of the variance in (X, Y) is explained by (A, B, C), for example. It also generalizes to 
more than two variables (with each variable being possibly vector valued), which could 
be used to discover lower dimensional manifolds embedded in a higher dimensional 
space (see SI5). 



As pointed out in Speed (201 1 1, an important question is how much of the variance 
in Y can be explained by X, after controlling for C. Here we introduce a non-linear 
analog of the semipartial correlation coefficient. We show (see SI6) that this agrees 
with the linear semipartial correlation when all relationships are linear. When the 
relationships are non-linear, however, the standard linear semipartial correlation can 
severely underestimate semipartial association, but it can also overestimate the semi- 
partial association between Y and X by ignoring a non-linear dependence of Y on the 
control variable C, returning values close to 1 when in fact Y is conditionally inde- 
pendent of X given C. Our non-linear semipartial association has no such difficulty, 
returning values close to for such cases. 

While this paper represents the initial practical contribution, further work remains 
to characterize the theoretical properties of A and A. A is clearly invariant to mono- 
tonic transformations of variables, but its estimate A is not, although it may be as N 
tends to infinity. Simulations suggest that A tends to wherever variables are inde- 
pendent, and 1 whenever a relation is noiseless and nowhere flat, but perhaps there are 
other circumstances under which 1 will be the large sample limit (MIC, for example, 
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can achieve 1 at large samples for noisy relationships - see SI7). Is the A test for in- 
dependence consistent against all alternatives, achieving a power of 100% as N tends 
to infinity whenever independence is violated in any way? A appears to be robust to 
outliers (see SI8), but is it possible to design outlier distributions that mislead it? A 
could also be improved by more sophisticated techniques to estimate the density ratio 



of the joint and independent distributions (Suzuki et al. 2009 1 Vincent and Bengio 



2002), which may improve the convergence of A for smaller sample sizes, but at a 
computational cost. 



Methods 

Consider two (possibly) vector valued variables, X and Y, with n observations {x! , . . . , 
and {yi, . . . , y n }. Each Xj itself may be a vector xf, . . . , x\, as may each y,. Further, 
imagine three kernel distributions, K x (x), Ky(y) and Kxy(< x, y >), where the 
kernels are symmetric, non-negative, and integrate to 1, and where angle brackets indi- 
cate vector concatenation. Our null model assumes that X and Y are independent, and 
so we define the leave-one-out cross validation likelihood as the product of marginal 
kernel density estimates: 



Lev {null) 



I I p (^\xvj^i)P(yt\yvj^i) 



n 



K x (*j - x. 



71-1 



><E 



K Y (yj - y» 

71—1 



The alternative model allows Y to depend on X for a proportion of points, w, with 
a leave-one-out cross validation likelihood defined as: 



Lev (alt) = ]J i w x P(xi>yi\xvjyti,yvjyti) + (1 - w)P(x; |x V j^)P(yi \yvjjn) 



n 

1=1 



k xy{< Xj - Xj, yj - y { >) + ^ _ ^ ^ K x (x.j - *i) ^ K Y (y 3 - y l 



71 — 1 



In our particular implementation, the values of each variable are replaced with 
their ranks (this is for computational convenience and should have little effect since 
A itself is invariant to order preserving transformations of variables), and the kernels 
are isotropic Gaussians, with K% and Ky sharing an 'independent' kernel variance 
parameter a], and K X y having a distinct 'dependent' variance parameter, <j 2 D . The 
null model thus has a single parameter, a], and the alternative model has 3 parame- 
ters: erf, erf}, and w. These parameters are optimized numerically to maximize the 
cross-validation likelihood, yielding A after employing equation 1. We found this esti- 
mate to be slightly biased down (relative to classical R 2 for bivariate Gaussians, where 
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A=classical R 2 ), especially for small samples, so we included an empirically-estimated 
small samples correction (see SI9). An R package named "matie" (Measuring Asso- 
ciation and Testing Independence Efficiently" - see SI10) for estimating A is available 
on CRAN (http : / / cran . r-pro ject . org/web/packages/matie/). Like 
MIC, estimating A is quadratic in the sample size, but with a much lower growth rate 
than MIC (see Sill). Supporting Information can be found at: |http : / /www . csT 
|sun . ac . za/ ~bmurrell/Murrell_Matie_SI . pdf | 
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